Week 9 of 12 · Part B — Alignment Literacy

The Oversight Toolkit

Locking in Week 9 — and closing Part B with one coherent argument

Day 45 ~50 minutes Review

Day 45 of 60

What you now hold

You can now name five alignment techniques and the tradeoff each makes. RLHF aligns on human feedback but doesn't scale. RLAIF / Constitutional AI moves labeling onto an auditable constitution. Debate lets a weak judge supervise a hard question through adversarial structure. Weak-to-strong reframes alignment as eliciting a strong model's latent capability under weak supervision. Process supervision rewards correct reasoning, not just correct answers. That's the working toolkit of scalable oversight.

The through-line of Week 9

Every technique is one answer to a single question — how do you get reliable supervision when the model outgrows its supervisor? And the honest verdict across all of them is that none dominates on scalability, human-cost, and deception-robustness at once. Maturity in this field is holding the promise and the limit of each in the same hand.

Closing Part B: one argument, three layers

The Synthesis

Diagnosis → emerging risk → verification → solution

Part B is one connected story. Alignment (Weeks 6) explained why capable optimizers can pursue the wrong goal. Deception (Week 7) showed that failure can be hidden and survive safety training. Interpretability (Week 8) is the bet on reading internals to catch what behavior hides. Oversight (this week) is the engineering response: techniques to supervise systems we can't fully check.

The loop back to Part A

Oversight isn't separate from the applied work you did in Part A — it's what makes it trustworthy. Better supervision produces models whose behavior your evals and red-teams can actually rely on. The frontier and the front line are the same fight from two ends.

Why "no technique dominates" is the mature take

The amateur wants a winner. The professional reports the real state of the field: a toolkit of partial, complementary methods you stack by context, each with an honest limit. Carrying that nuance — promise and caveat, every time — is what makes your alignment literacy credible rather than performative.

Self-quiz — can you do these without notes?

Prove the Week

~50 minutes

Name all five techniques and one tradeoff each — without looking.
Distinguish RLHF from RLAIF, and explain what "scalable oversight" means in one sentence.
Explain the assumption debate rests on, and one situation where it fails.
State the weak-to-strong result and its strongest limitation, then explain why alignment is partly an elicitation problem.
Finalize your CAI → debate → weak-to-strong → process-supervision comparison, write your Week 9 summary, and note how oversight makes Part A's evals trustworthy.

The expert move

A practitioner can list the oversight techniques. An expert frames them as one toolkit answering one question — reliable supervision past the supervisor's limit — and closes the loop: oversight is what lets the evals and red-teams of Part A be trusted at all. The altitude jump is from cataloguing methods to arguing how diagnosis, deception, interpretability, and oversight compose into a single account of why a system is safe enough to deploy.

Say this in an interview: "Scalable oversight is the engineering end of alignment — RLAIF, debate, weak-to-strong, and process supervision are partial, complementary answers to supervising systems we can't fully check, and none dominates, so you stack them by context. And it's not separate from applied safety: better oversight is what makes the evals and red-teams I'd run actually trustworthy."

Week 9 Takeaways

Five techniques, one question: reliable supervision past the supervisor's limit.
The honest verdict: no technique dominates on scalability, human-cost, and deception-robustness — you stack them.
Part B is one argument: alignment → deception → interpretability → oversight, and it loops back to make Part A's evals trustworthy.
Next: Part C — governance, where these claims become auditable risk registers and model cards.